This assignment is based on the material covered in James et al. We will subsequently open up solutions to the problem sets.

Exercise 4.1

This question involves the use of multiple linear regression on the Auto data set. So load the data set from the ISLR package first.

If the following code chunk returns an error, you most likely have to install the ISLR package first. Use install.packages("ISLR") if this is the case.

# install.packages("ISLR")
data("Auto", package = "ISLR")

# 

# !! `auto <- try(data(Auto, package="ISLR")` doesn't work. Can't assign a dataset to a variable. 

Dataset description: Gas mileage, horsepower, and other information for 392 vehicles. > A data frame with 392 observations on the following 9 variables. Fields available: - mpg: miles per gallon - cylinders: Number of cylinders between 4 and 8 - displacement Engine displacement (cu. inches) - horsepower Engine horsepower - weight: Vehicle weight (lbs.) - acceleration: Time to accelerate from 0 to 60 mph (sec.) - year Model year (modulo 100) - origin: Origin of car (1. American, 2. European, 3. Japanese) - name Vehicle name

  1. Produce a scatterplot matrix which includes all of the variables in the data set.
par(mfrow=c(4,3))
pairs(Auto[1:3])

pairs(Auto[4:6])

pairs(Auto[7:9])

plot(Auto)

  1. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# can subset whith col number
print(which(colnames(Auto)=="name") )
## [1] 9
Auto1 <- Auto[c(-9)]

# can subset with names 
Auto2 <- Auto[c(-which(colnames(Auto)=="name"))]

# can subset with subset
Auto3 <- subset(Auto, select = !colnames(Auto) %in% "name")

# fail to  subset with filter
# `Auto4 <- Auto %>% dplyr::filter(UQ(!!as.name("name")=="name")))`

# can subset with select_if 
Auto4 <- Auto %>% select_if(is.numeric)

# check if 2 dfs are equal
identical(Auto1, Auto2)
## [1] TRUE
# another method to check if 2 dfs are equal
isTRUE(all.equal(Auto2, Auto3))
## [1] TRUE
all.equal(Auto2, Auto3)
## [1] TRUE
  1. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

    1. Is there a relationship between the predictors and the response?
    2. Which predictors appear to have a statistically significant relationship to the response?
    3. What does the coefficient for the year variable suggest?
# another way to reg on all var --- `myregreg <- with(Auto1, lm(mpg~., data = Auto1))`
myreg = lm(mpg ~ ., data=Auto1)
summary(myreg)
## 
## Call:
## lm(formula = mpg ~ ., data = Auto1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
# do the plot
plot(myreg)

par(mfrow=c(2,2))
with(Auto1, plot(cylinders, mpg))
abline(lm(mpg~cylinders, Auto1))

with(Auto1, plot(displacement, mpg))
abline(lm(mpg~displacement, Auto1), col="red")

with(Auto1, plot(weight, mpg))   
abline(lm(mpg~weight, Auto1), col="red")

# intercept: -17.218435
# slope for acceleration : 0.080576
with(Auto1, plot(acceleration, mpg))
abline(lm(mpg~acceleration, Auto1))

# this snippet shows that the lines for SLR and MLR on the same variables are not the same. 
par(mfrow=c(2,2))
with(Auto1, plot(acceleration, mpg))
abline(lm(mpg~acceleration, Auto1))

coefficients(myreg)
##   (Intercept)     cylinders  displacement    horsepower        weight 
## -17.218434622  -0.493376319   0.019895644  -0.016951144  -0.006474043 
##  acceleration          year        origin 
##   0.080575838   0.750772678   1.426140495
with(Auto1, plot(acceleration, mpg))
abline(as.numeric(myreg$coefficients[1]), as.numeric(myreg$coefficients[6]), col="blue")
abline(lm(mpg ~ acceleration, Auto1), col="red")

colnames(Auto1)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"
par(mfrow=c(2,1))
with(Auto1, plot(year, mpg))
abline(lm(mpg~year, Auto1))     

summary(lm(mpg~year, Auto1))
## 
## Call:
## lm(formula = mpg ~ year, data = Auto1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.0212  -5.4411  -0.4412   4.9739  18.2088 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -70.01167    6.64516  -10.54   <2e-16 ***
## year          1.23004    0.08736   14.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.363 on 390 degrees of freedom
## Multiple R-squared:  0.337,  Adjusted R-squared:  0.3353 
## F-statistic: 198.3 on 1 and 390 DF,  p-value: < 2.2e-16
with(Auto1, plot(origin, mpg))
abline(lm(mpg~origin, Auto1))

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant? > NT: Fields available: - mpg: miles per gallon - cylinders: Number of cylinders between 4 and 8 - displacement: Engine displacement (cu. inches) - horsepower: Engine horsepower - weight: Vehicle weight (lbs.) - acceleration: Time to accelerate from 0 to 60 mph (sec.) - year: Model year (modulo 100) - origin: Origin of car (1. American, 2. European, 3. Japanese) - name: Vehicle name

horse_acceleration <- Auto1$horsepower * Auto1$acceleration
plot(Auto1$acceleration, Auto1$horsepower)
abline(lm(horsepower~acceleration, Auto1), col="red")

sprintf("It seems that horsepower and accelaration has high negative correlation of %f, with a fitted prodictor value of %f ",cor(Auto1$horsepower, Auto1$acceleration), coef(lm(horsepower~acceleration, Auto1))[2])
## [1] "It seems that horsepower and accelaration has high negative correlation of -0.689196, with a fitted prodictor value of -9.615528 "
Autoplus <- Auto1 %>% dplyr::mutate(horsepower * acceleration)
# go, go ahead with regressing on that
lm.interact <-lm(mpg ~ ., data=Autoplus)
summary(lm.interact)
## 
## Call:
## lm(formula = mpg ~ ., data = Autoplus)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0329 -1.8177 -0.1183  1.7247 12.4870 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -32.499820   4.923380  -6.601 1.36e-10 ***
## cylinders                     0.083489   0.316913   0.263 0.792350    
## displacement                 -0.007649   0.008161  -0.937 0.349244    
## horsepower                    0.127188   0.024746   5.140 4.40e-07 ***
## weight                       -0.003976   0.000716  -5.552 5.27e-08 ***
## acceleration                  0.983282   0.161513   6.088 2.78e-09 ***
## year                          0.755919   0.048179  15.690  < 2e-16 ***
## origin                        1.035733   0.268962   3.851 0.000138 ***
## `horsepower * acceleration`  -0.012139   0.001772  -6.851 2.93e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.145 on 383 degrees of freedom
## Multiple R-squared:  0.841,  Adjusted R-squared:  0.8376 
## F-statistic: 253.2 on 8 and 383 DF,  p-value: < 2.2e-16

I note that the interaction term is highly statistically significant. In addition, adjusted r-squared has gone up from 0.8182238 to 0.8376468

sprintf("I note that the interaction term is highly statistically significant. In addition, adjusted r-squared has gone up from %f to %f", summary(myreg)$adj.r.squared, summary(lm.interact)$adj.r.squared)
## [1] "I note that the interaction term is highly statistically significant. In addition, adjusted r-squared has gone up from 0.818224 to 0.837647"
  1. Try a few different transformations of the variables, such as \(log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.
lm.horse <- lm(mpg~horsepower, data=Auto1)
summary(lm.horse)$r.squared
## [1] 0.6059483
lm.log_horse <- lm(mpg~log(horsepower), data=Auto1)
summary(lm.log_horse)$r.squared
## [1] 0.6683348
par(mfrow = c(2,2))
plot(Auto1$horsepower, Auto1$mpg)
plot(log(Auto1$horsepower), Auto1$mpg)

Exercise 4.2

This question should be answered using the Carseats dataset from the ISLR package. So load the data set from the ISLR package first.

data("Carseats", package = "ISLR")
# colnames(select_if(Carseats,is.numeric))
# [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
# [6] "Price"       "Age"         "Education"  
  1. Fit a multiple regression model to predict Sales using Price, Urban, and US.
attach(Carseats)
lm.1 <- lm(Sales ~ Price + Urban + US)
summary(lm.1)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
# plot something 
plot(Sales, fitted.values(lm(Sales~Price)))
abline(lm(Sales~Price), col="blue")
abline(lm(Sales~residuals(lm(Sales~Price))), col="red")

# data satisfies homogeneity assumption
plot(residuals(lm(Sales~Price)))

  1. Provide an interpretation of each coefficient in the model. Be careful – some of the variables in the model are qualitative!
is.numeric(c(US, Price, Sales, Urban))
## [1] TRUE
# import script to use my own function
source("myfunctions.R")
sapply(c(1:length(colnames(lm.1$model))), describe_coef, regression=lm.1)
## [1] "Intercept of model is 13.043469"                                           
## [2] "Coefficient 'Price' has a slope of -0.054459. It is a character variable. "
## [3] "Coefficient 'Urban' has a slope of -0.021916. It is a character variable. "
## [4] "Coefficient 'US' has a slope of 1.200573. It is a character variable. "

For every 1 dollar increase in price, given the same car type (urban or not) and car origing (U.S. made or not), sales decreases by 0.05 units. Sales for urban cars is less than the non-urban ones by 0.02 units holding car type and car price constant. In addition, U.S. made cars sell better, with a sales performance of 1.2 units more than the non-U.S. made models.

  1. Write out the model in equation form, being careful to handle the qualitative variables properly. \[ \widehat{Sales} = \widehat{13.043469} + \widehat{-0.054459}*Price + \widehat{-0.021916}*Urban + \widehat{1.200573}*US + \epsilon\]

  2. For which of the predictors can you reject the null hypothesis \(H_0 : \beta_j =0\)? Predictor Urban has a t-statistic of -0.081 and a p-value of 0.936. It is statistically highly insignificant. I can reject the null hypothesis \(H_0 : \beta_Urban = 0\) at very low confidence interval.
  3. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm.small <- lm(Sales ~ Price + US)
summary(lm.small)
## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
  1. How well do the models in (a) and (e) fit the data?
sprintf("R-squared for this model is %s, the model can explain %s percent of variation in Sales. ", summary(lm.small)$r.squared, round(x = summary(lm.small)$r.squared*100, digits = 2))
## [1] "R-squared for this model is 0.239262888426786, the model can explain 23.93 percent of variation in Sales. "
  1. Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(object = lm.small, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

From the data, it can be said that the true effect of price on car sales will be within -0.0648 and -0.0442, 95% of the time, if the linear regression is performed on an infinite number of sample data. The true effect of the car being U.S. made is within 0.6915 and 1.7078 given the same conditions.

  1. Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow=c(2,2))
plot(Price, Sales)
plot(US, Sales, xlab = "US", ylab="Sales")

The graphs provide graphical evidence that there are outliers in both Price and US variables.

Exercise 4.3 (Optional)

In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

  1. Using the rnorm() function, create a vector, x, containing 100 observations drawn from a \(N(0,1)\) distribution. This represents a feature, X.
set.seed(1)
X <- rnorm(100, mean = 0, sd = 1)
  1. Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a \(N(0,0.25)\) distribution i.e. a normal distribution with mean zero and variance 0.25.
set.seed(1)
eps <- rnorm(100, mean = 0, sd = 0.25)
  1. Using x and eps, generate a vector y according to the model \[Y = -1 + 0.5X + \epsilon\] What is the length of the vector y? What are the values of \(\beta_0\) and \(\beta_1\) in this linear model?
Y = -1 + 0.5 * X + eps
hist(Y)

  1. Create a scatterplot displaying the relationship between x and y. Comment on what you observe.
plot(X, Y)
abline(lm(Y~X))

sprintf("Y and X does not seem to be a strong relationship. In addition, it has a weak correlation of %f. ", cor(X, Y))
## [1] "Y and X does not seem to be a strong relationship. In addition, it has a weak correlation of 1.000000. "
  1. Fit a least squares linear model to predict y using x. Comment on the model obtained. How do \(\hat{\beta}_0\) and \(\hat{\beta}_1\) compare to \(\beta_0\) and \(\beta_1\)?
lm.2 <- lm(Y~X)

sprintf("The model has an r-squared of %f, with a F-statistic of %f", summary(lm.2)$r.squared , summary(lm.2)$fstatistic[1])
## Warning in summary.lm(lm.2): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(lm.2): essentially perfect fit: summary may be
## unreliable
## [1] "The model has an r-squared of 1.000000, with a F-statistic of 648726956315797987188980599226368.000000"
sprintf("The predictor X is not statistically significant with a t-statistic of only 0.237 and a p-value of 0.813. We can be highly confident that the predictor for X, $/beta_0 = 0$")
## [1] "The predictor X is not statistically significant with a t-statistic of only 0.237 and a p-value of 0.813. We can be highly confident that the predictor for X, $/beta_0 = 0$"
  1. Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.
plot(Y, X)
abline(lm(Y~X), col="blue")
abline(Y,X, col="red")

# TODO add legend: `legend(0.2,0.3, "hello" )`
# TODO how to plot the population regression line?
  1. Now fit a polynomial regression model that predicts \(y\) using \(x\) and \(x^2\). Is there evidence that the quadratic term improves the model fit? Explain your answer.
X2 <- X^2
lm.3 <- lm(Y~X + X2)
sprintf("The new r-squared and adjusted r-squared is %f and %f respectively. This is greater than the old rsquared with just a degree-1 linear fit of %f and %f. ", summary(lm.3)$r.squared, summary(lm.3)$adj.r.squared, summary(lm.2)$r.squared, summary(lm.2)$adj.r.squared)
## Warning in summary.lm(lm.3): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(lm.3): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.2): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(lm.2): essentially perfect fit: summary may be
## unreliable
## [1] "The new r-squared and adjusted r-squared is 1.000000 and 1.000000 respectively. This is greater than the old rsquared with just a degree-1 linear fit of 1.000000 and 1.000000. "
# can adj.r.squared be negative??? 
plot(X, Y)
abline(lm.2, col="red")
abline(lm.3, col="blue")
## Warning in abline(lm.3, col = "blue"): only using the first two of 3
## regression coefficients

  1. Repeat (a)-(f) after modifying the data generation process in such a way that there is less noise in the data. The model should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.
set.seed(1)
eps_lessnoise <- rnorm(100, mean = 0, sd = 0.2)
Y_lessnoise <- -1 + 0.5*X + eps_lessnoise

lm.4 <- lm(Y_lessnoise~X)
lm.5 <- lm(Y_lessnoise ~ X + X2)

sprintf("After having reduced the error term, it is found that the new r-squared and adjusted r-squared for degree-1 linear model is %f and %f respectively. Mean while, the polynomial fit has an rsquared of %f and adjusted r-squared of %f. Therefore, linear fit looks better for this simulated data.", summary(lm.4)$r.squared, summary(lm.4)$adj.r.squared, summary(lm.5)$r.squared, summary(lm.5)$adj.r.squared)
## Warning in summary.lm(lm.4): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(lm.4): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.5): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(lm.5): essentially perfect fit: summary may be
## unreliable
## [1] "After having reduced the error term, it is found that the new r-squared and adjusted r-squared for degree-1 linear model is 1.000000 and 1.000000 respectively. Mean while, the polynomial fit has an rsquared of 1.000000 and adjusted r-squared of 1.000000. Therefore, linear fit looks better for this simulated data."
plot(X, Y_lessnoise)
abline(lm.4, col="red")

print("Graphically, this looks like a linear fit. ")
## [1] "Graphically, this looks like a linear fit. "
  1. Repeat (a)-(f) after modifying the data generation process in such a way that there is more noise in the data. The model should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term \(\epsilon\) in (b). Describe your results.
set.seed(1)
eps_morenoise <- rnorm(100, mean = 0, sd = 5)
Y_morenoise <- -1 + 0.5*X + eps_morenoise

lm.6 <- lm(Y_morenoise~X)
lm.7 <- lm(Y_morenoise ~ X + X2)

sprintf("After having increased variation in the error term, it is found that the new r-squared and adjusted r-squared for degree-1 linear model is %f and %f respectively. Mean while, the polynomial fit has an rsquared of %f and adjusted r-squared of %f. Therefore, polynomial fit looks better for this simulated data.", summary(lm.6)$r.squared, summary(lm.6)$adj.r.squared, summary(lm.7)$r.squared, summary(lm.7)$adj.r.squared)
## Warning in summary.lm(lm.6): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(lm.6): essentially perfect fit: summary may be
## unreliable
## Warning in summary.lm(lm.7): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(lm.7): essentially perfect fit: summary may be
## unreliable
## [1] "After having increased variation in the error term, it is found that the new r-squared and adjusted r-squared for degree-1 linear model is 1.000000 and 1.000000 respectively. Mean while, the polynomial fit has an rsquared of 1.000000 and adjusted r-squared of 1.000000. Therefore, polynomial fit looks better for this simulated data."
# TODO adjusted r-squared is again negative. 
  1. What are the confidence intervals for \(\beta_0\) and \(\beta_1\) based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.
confint(lm.2)
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
##             2.5 % 97.5 %
## (Intercept) -1.00  -1.00
## X            0.75   0.75
confint(lm.4)
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
##             2.5 % 97.5 %
## (Intercept)  -1.0   -1.0
## X             0.7    0.7
confint(lm.6)
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
##             2.5 % 97.5 %
## (Intercept)  -1.0   -1.0
## X             5.5    5.5
sprintf("The range of confident interval for X in ORIGINAL model is %f ", range_of_confint(lm.4))
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## [1] "The range of confident interval for X in ORIGINAL model is 1.700000 "
sprintf("The range of confident interval for X in the NOISIER model is %f ", range_of_confint(lm.2))
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## [1] "The range of confident interval for X in the NOISIER model is 1.750000 "
sprintf("The range of confident interval for X in the LESS NOISY model is %f ", range_of_confint(lm.6))
## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable

## Warning in summary.lm(object): essentially perfect fit: summary may be
## unreliable
## [1] "The range of confident interval for X in the LESS NOISY model is 6.500000 "